test(ika-upgrade-test): out-of-process cross-binary upgrade harness#1727
Open
ycscaly wants to merge 183 commits into
Open
test(ika-upgrade-test): out-of-process cross-binary upgrade harness#1727ycscaly wants to merge 183 commits into
ycscaly wants to merge 183 commits into
Conversation
Foundation for the off-chain validator-metadata read flow. Pure types and no-op consensus dispatch — no behavior change, so the acceptance gate `test_network_dkg_full_flow` still passes. New types in `ika_types::validator_metadata`: - ValidatorMpcDataAnnouncement / SignedValidatorMpcDataAnnouncement - HandoffItemKey (sorted enum: NetworkDkgOutput | NetworkReconfigurationOutput | ValidatorMpcData) - HandoffAttestation with `items: Vec<(HandoffItemKey, [u8;32])>` sorted strictly ascending — plain length-prefixed BCS list, no map-aware bindings needed for non-Rust verifiers - HandoffSignatureMessage (Ed25519 sig by consensus key, NOT protocol key) - CertifiedHandoffAttestation (Vec<(AuthorityName, Ed25519Signature)>; Ed25519 doesn't aggregate) - EpochMpcDataReadySignal IntentScope: +ValidatorMpcDataAnnouncement, +HandoffAttestation. ConsensusTransactionKind + Key: 3 new variants + constructors + key extraction + Debug arms. AuthorityPerEpochStore / consensus_handler / consensus_validator wire dispatch as no-ops (actual handlers land in later steps); the per-epoch sender-author match enforces wire-binding for HandoffSignature and EpochMpcDataReadySignal (signer == consensus author), and is a trivial pass for ValidatorMpcDataAnnouncement (the inner BLS sig authenticates the validator's intent independent of the relayer). Unit tests cover BCS roundtrip + sort stability + ready-signal roundtrip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Anemo `ValidatorMetadata` service with one method `GetMpcDataBlob(blob_hash) -> Option<MpcDataBlob>`. Backed by an `InMemoryBlobStore` (RwLock<HashMap<[u8;32], Vec<u8>>>) implementing `MpcDataBlobStorage`. Callers hash-verify returned bytes — the network layer doesn't, and the doc comment on `fetch_blob` says so. `AuthorityPerpetualTables::mpc_artifact_blobs: DBMap<[u8;32], Vec<u8>>` with insert / get / iter helpers — the cross-restart store. At node startup `create_p2p_network` iterates that table and hydrates the in-memory cache before mounting the anemo server, so a restart keeps serving whatever blobs the validator had persisted. No producers or consumers wire up yet — those land in subsequent steps. The endpoint just serves whatever's been inserted (initially nothing on a fresh node). Acceptance gate `test_network_dkg_full_flow` passes (142s). 2 new unit tests in ika-network (`in_memory_blob_store_roundtrip`, `mpc_data_blob_hash_is_deterministic`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Producer side (ika_core::validator_metadata): - derive_mpc_data_blob(seed) returns the canonical BCS-encoded VersionedMPCData::V1 bytes — same encoding the CLI submits on chain via set_next_epoch_mpc_data_bytes. Deterministic from seed, so off-chain blobs hash-match chain bytes. - now_ms() for the announcement timestamp (latest-by-timestamp rule means later calls win, which is correct after a seed rotation). - sign_validator_mpc_data_announcement(...) builds + BLS-signs the announcement ready for consensus. Consumer side (AuthorityPerEpochStore): - New per-epoch table validator_mpc_data_announcements: DBMap<AuthorityName, SignedValidatorMpcDataAnnouncement>. - record_validator_mpc_data_announcement verifies the BLS sig against self.committee() (current-epoch path only — next-epoch joiner path deferred to step 6) and applies the latest-by-timestamp rule on insert. Replays and stale duplicates are silently dropped. - get_validator_mpc_data_announcement accessor. - Consensus dispatch wires the ConsensusTransactionKind:: ValidatorMpcDataAnnouncement variant through. Unit tests in ika-core::validator_metadata: - derive_mpc_data_blob_is_deterministic - sign_announcement_verifies_against_signer (covers intent scope + epoch binding + tamper detection). Acceptance gate test_network_dkg_full_flow still passes (143s). No producers wired up yet — they land in subsequent steps along with the ready-signal freeze. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds two new epoch tables and a producer helper for the freeze step of the off-chain validator-metadata flow. `epoch_mpc_data_ready_signals` records, per authority, that this validator has decided its mpc_data input set is sufficient (`>= quorum_threshold` announcements observed). The first incoming signal that crosses quorum triggers `freeze_mpc_data_if_first`, which idempotently snapshots `validator_mpc_data_announcements` into `frozen_validator_mpc_data_input_set` — the immutable, content- addressed view of validator mpc_data used by all downstream consumers (handoff, reconfig, joiner bootstrap). The signal payload itself is unauthenticated; authorisation is the consensus binding (the authority that submitted the transaction). This is enforced at consensus dispatch in `AuthorityPerEpochStore`. Producer side: `build_epoch_mpc_data_ready_signal_transaction` wraps the signal in a `ConsensusTransaction` ready for the consensus adapter. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 142.28s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Joining validators (in V_{e+1} but not in V_e) can't submit
directly to consensus because they aren't members of the current
consensus committee. They fan out their signed mpc_data
announcement to every current-committee peer over a new Anemo RPC
`SubmitMpcDataAnnouncement`; one honest relayer is enough to land
the announcement in consensus.
This commit lands the transport only:
- `SubmitMpcDataAnnouncementRequest{Response}` wire types.
- `AnnouncementRelay` trait (impl supplied by the node once epoch
store + consensus adapter are up).
- `AnnouncementRelayHandle` — an `ArcSwapOption` late-binding
holder, installed at first epoch start and re-installed across
epoch boundaries. The Anemo server is constructed at node
startup before any epoch store exists, so install-after-the-fact
is needed.
- Anemo server impl that returns `Rejected` while the relay is
uninstalled (joiners retry) and dispatches to the active relay
otherwise.
- Client helpers: `submit_announcement_to_peer` (single peer) and
`submit_announcement_to_committee` (concurrent fan-out).
Installation of the actual relay impl (which performs signature
verification against the pending active set) is deferred to the
PendingActiveSet step, since the relay needs that verification
before it can safely submit.
Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 142.61s.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replaces the placeholder next-epoch branch in `record_validator_mpc_data_announcement` with real signature verification gated on a `JoinerPubkeyProvider`. `JoinerPubkeyProvider::is_registered_joiner(&AuthorityName) -> bool` is the trait the Sui-backed lookup will implement; a future step populates it from `validator_set.pending_active_set` plus each entry's `StakingPool.validator_info`'s next-epoch pubkey. Until that lands, `joiner_pubkey_provider` is unset and all next-epoch announcements drop — current-epoch flow is unchanged. `verify_joiner_announcement` is a pure helper (caller passes `expected_epoch` and the provider). The per-epoch-store method calls it and reacts to the four-way verdict (Accept/UnregisteredJoiner/InvalidSignature/InconsistentEnvelope); only `Accept` proceeds to the latest-by-timestamp insert rule. The provider is held in an `ArcSwapOption` on `AuthorityPerEpochStore`, swappable across epoch boundaries via `install_joiner_pubkey_provider` / `clear_joiner_pubkey_provider`. `AuthorityName == AuthorityPublicKeyBytes`, so the verifier uses `signed.auth_sig.authority` as the pubkey directly — the provider only authorizes *which* names are joinable. Tests cover Accept, UnregisteredJoiner, InvalidSignature (tampered blob hash), InconsistentEnvelope (wrong epoch + authority field mismatch), and `StaticJoinerPubkeyProvider` membership semantics. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 148.28s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Lands the canonical, off-chain handoff attestation primitives behind the next-step record/persist plumbing. These are the building blocks each validator runs locally at EndOfPublish (builder + signer) and that every validator runs on incoming consensus signatures (verifier + aggregator). - `build_handoff_attestation`: sorts items strictly ascending by `HandoffItemKey` (the wire format is a Vec, not a map, so the sort defines the canonical bytes every signer commits to); rejects duplicate keys. - `hash_next_committee_pubkey_set`: dedup + sort + BCS-encode + Blake2b256 over the next committee's pubkey set. This goes in the attestation header, so verifiers can confirm the cert is bound to the committee they're handing off to. - `sign_handoff_attestation`: Ed25519 over `bcs(IntentMessage::new(HandoffAttestation, attestation))` — signed with the validator's *consensus* key, NOT BLS. (Joiners look up signers' consensus pubkeys in the prior committee's on-chain validator info.) - `ConsensusPubkeyProvider` trait + `StaticConsensusPubkeyProvider` for the consensus-pubkey lookup, mirroring the joiner-provider shape from step 6. - `verify_handoff_signature` returns a four-way verdict (Accept/UnknownSigner/InvalidSignature/AttestationMismatch). - `HandoffAggregator`: one-shot stake-weighted aggregator that emits `CertifiedHandoffAttestation` the first time signers cross `committee.quorum_threshold()`. Replacements don't double-count; non-committee signers are silently dropped (the consensus path also rejects them at the dispatch site, but the aggregator is defense-in-depth). - `verify_certified_handoff_attestation`: standalone re-verify against a committee + provider — what joiners run during bootstrap on the cert they fetched. Tests cover sort canonicalization, duplicate-key rejection, pubkey-set hash invariance under reorder and dedup, sign+verify round trip with the four verdict outcomes, aggregator quorum crossing, replacement no-op, non-committee signer no-op, and end-to-end certify-then-re-verify-with-tampered-sig. Record / persist / EndOfPublish-trigger wiring land in follow-on commits; these helpers are isolated and consumed at those sites. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 143.26s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wires the consensus dispatch path for `HandoffSignature` to verify, persist, and aggregate incoming Ed25519 signatures over the epoch's handoff attestation. Per-epoch state on `AuthorityPerEpochStore`: - `handoff_signatures: DBMap<AuthorityName, Ed25519Signature>` — durable record of each verified signer's sig. Replays are no-ops via typed-store insert semantics. - `expected_handoff_attestation: ArcSwapOption<HandoffAttestation>` — this validator's locally-computed attestation, installed by the producer side once mpc_data is frozen + DKG/reconfig digests are known. Until installed, incoming signatures drop silently (`AttestationMismatch` is the only possible verdict). - `consensus_pubkey_provider: ArcSwapOption<...>` — Ed25519 lookup for signer pubkeys, populated by the same sui_syncer task that feeds the joiner provider. - `handoff_aggregator: Mutex<Option<HandoffAggregator>>` — in-memory stake accumulator. Rebuilt from persisted signatures when the expected attestation is (re)installed, so restart replay folds prior consensus-ordered signatures back in correctly. New pure helper in `validator_metadata`: - `process_handoff_signature` runs `verify_handoff_signature` and, on `Accept`, inserts into the aggregator. Returns one of `Recorded`, `Certified(cert)`, or `Rejected(verdict)`. Three new unit tests cover quorum-crossing, attestation mismatch, and unknown-signer paths. `PartialEq`/`Eq` added to `HandoffSignatureMessage` and `CertifiedHandoffAttestation` so the record-outcome enum can derive those traits for tests. Consensus dispatch: the `HandoffSignature` arm now calls `record_handoff_signature`. The returned cert (when quorum just crossed) is intentionally dropped on the floor for now — the perpetual-persist plumbing (step 7c) hangs off a dedicated drain task that pulls from the in-memory aggregator. Dropping is safe because the *next* ordered signature crossing quorum still mints a cert, and restart-replay rebuilds the aggregator. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 142.08s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes the handoff write path: once `record_handoff_signature`'s in-memory aggregator crosses quorum, the resulting `CertifiedHandoffAttestation` is immediately persisted into a keep-forever perpetual table. `AuthorityPerpetualTables`: - New `certified_handoff_attestations: DBMap<EpochId, CertifiedHandoffAttestation>` table, keyed by the epoch the outgoing committee is handing off *from*. - `insert_certified_handoff_attestation`, `get_certified_handoff_attestation`, `iter_certified_handoff_attestations` accessors. The handoff feedback rule (keep certs forever) is load-bearing because a joiner pulling history may need to verify the chain back to whichever cert it has a trusted committee for; skipping any single epoch's cert would permanently break their ability to bootstrap. `AuthorityPerEpochStore` gains `perpetual_tables_for_handoff: ArcSwapOption<...>` plus `install_perpetual_tables_for_handoff`. `ika-node` installs the perpetual handle directly after constructing the epoch store, so the very first cert produced by consensus lands on disk. When nothing is installed (e.g. unit tests that don't wire perpetual), the record path logs at debug level and keeps going — the cert stays in the in-memory aggregator and joiner-bootstrap consumers will simply miss it. The `Certified` arm of `record_handoff_signature` now also performs the perpetual write, with the persist failure logged (not propagated) — failing the entire consensus-dispatch path on a perpetual-DB hiccup would be far worse than a missing cert. Tests: 3 new perpetual-table unit tests cover insert/get roundtrip, ordered iteration across epochs, and byte-level idempotency on identical re-writes. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 141.68s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes the producer half of the handoff loop: when this validator reaches EndOfPublish, the same task that submits its `EndOfPublish` consensus transaction also builds, installs, signs, and submits its `HandoffSignatureMessage` for the epoch — exactly once. The trigger pipeline: 1. `compute_handoff_items` (pure): combines frozen mpc_data set + per-network-key DKG output digests + per-network-key reconfig output digests into a sorted Vec<(HandoffItemKey, [u8;32])>. Empty inputs are valid (yields an empty list) — important because DKG/reconfig digest caching is step 9, and the attestation needs to be signable before then. 2. `AuthorityPerEpochStore::build_local_handoff_attestation`: reads the frozen set, hashes the supplied next-committee pubkey set, calls compute_handoff_items, and builds a well-formed attestation. 3. `AuthorityPerEpochStore::build_local_handoff_signature_transaction`: installs the attestation locally (so the per-epoch record path accepts matching peer signatures), signs it with the consensus key, and wraps it in a `ConsensusTransaction`. 4. `EndOfPublishSender` is upgraded to take the consensus keypair (Arc) + a `Receiver<Committee>` for the next epoch, plus an `AtomicBool` one-shot flag. The handoff submit happens after the EndOfPublish submit on the same tick. Determinism across validators: identical inputs → identical attestation bytes → matching signatures. The frozen set is already agreed (step 4's quorum freeze); the next-committee pubkey set is read from chain. Until step 9 populates DKG/reconfig digests, every validator computes an attestation with those slots empty — still agreed. The handoff record path (step 7b) was already wired to consume these signatures, and the perpetual persist (step 7c) writes the cert as soon as quorum is reached. With this commit, the cycle runs end-to-end given an actual EndOfPublish trigger. Tests: 2 new unit tests cover `compute_handoff_items` sorting + empty-input semantics, in addition to the existing 19 helpers tests. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 144.29s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the read side that closes the handoff loop: peers can pull a
`CertifiedHandoffAttestation` for any persisted epoch over a new
`ValidatorMetadata::GetCertifiedHandoffAttestation` RPC, and joiners
have a single-hop verification helper that binds the cert to the
specific committee they're trying to join.
Network layer:
- New `GetCertifiedHandoffAttestationRequest { epoch }` wire type.
- New `HandoffCertStorage` trait — the read-only counterpart to
the perpetual store. Server holds an `Arc<C: HandoffCertStorage>`
alongside the existing blob store.
- `ValidatorMetadataServer` is now `Server<S, C>`; the
`build_server(storage, relay, cert_storage)` signature gained the
`cert_storage` arg.
- Joiner-side `fetch_certified_handoff_attestation(network, peer,
epoch)` mirrors the existing `fetch_blob`.
Adapter:
- `AuthorityPerpetualTables` implements `HandoffCertStorage` by
delegating to `get_certified_handoff_attestation` and logging
(not propagating) a perpetual-read error as `None`. The Anemo
hot path can't surface a typed error usefully.
ika-node:
- The perpetual handle is now passed into `build_server` so peers
immediately see every cert that lands on disk (via step 7c's
perpetual persist). No additional installation needed because
`AuthorityPerpetualTables` is constructed eagerly at startup.
Joiner bootstrap helper in `ika-core::validator_metadata`:
- `verify_joiner_bootstrap_cert(cert, prior_committee, prior_
consensus_pubkeys, expected_next_committee_pubkeys)` runs the
full check: pubkey-set-hash binding (so a malicious peer can't
hand a real cert for a different committee), then delegates to
the existing `verify_certified_handoff_attestation` for the
signature/stake check. One-hop only — joiners verify against
the *prior* committee, not back to genesis. (Per handoff design
memo: anchoring trust to the prior committee is sufficient since
the joiner gets there through earlier hops they either already
trust or are themselves bootstrapping from a known anchor.)
Tests: 1 new unit test exercising both the happy path and the
pubkey-set-mismatch refusal.
Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 143.31s.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Populates the producer-side caches that feed the handoff attestation's `NetworkDkgOutput` / `NetworkReconfigurationOutput` items. `AuthorityPerEpochStoreTrait` gains two methods, called from the MPC producer at the exact point it builds the consensus output: - `cache_network_dkg_output(key_id, output_bytes)` - `cache_network_reconfiguration_output(key_id, output_bytes)` Concrete `AuthorityPerEpochStore` impl: - Hashes `output_bytes` to Blake2b256 (matching `mpc_data_blob_hash`'s function so peers can fetch this blob over the existing `GetMpcDataBlob` RPC). - Writes the digest into one of two new per-epoch tables — `network_dkg_output_digests` or `network_reconfiguration_output_digests` — keyed by `dwallet_network_encryption_key_id`. - Writes the blob bytes into perpetual `mpc_artifact_blobs` (if the perpetual handle is installed) so cross-restart serves work for free. - All writes are idempotent on byte-identical replays. `build_local_handoff_attestation` no longer takes the digest maps as parameters; it reads them straight off the per-epoch store. `EndOfPublishSender::send_handoff_signature` is updated to match. Producer hook: `DWalletMPCService::new_dwallet_mpc_output`'s User/System branch calls the trait methods for the DKG and reconfig protocols (`!rejected` only — rejected outputs are empty and shouldn't pollute the cache). Cache failures are logged, not propagated — they don't fail the consensus output emit, just degrade peer serveability. `TestingAuthorityPerEpochStore` gets no-op impls; the integration test gate doesn't exercise attestation contents so an in-memory mirror isn't needed. Tests: 2 new unit tests cover the per-epoch table semantics — digest roundtrip + replay idempotency, and independence of the DKG vs reconfig caches when keyed by the same key_id. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 141.54s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the per-network-key counterpart to `EpochMpcDataReadySignal`.
Validators can now signal readiness for a specific network key's
DKG (`NetworkKeyDKGReadySignal { authority, network_key_id,
epoch }`) earlier than the epoch-wide signal, because per-key
readiness is a narrower commitment — the validator only needs the
mpc_data required for *this* key, not all reconfig sessions.
Per-epoch state:
- `network_key_dkg_ready_signals: DBMap<(ObjectID, AuthorityName),
()>` — per-key, per-authority votes. Composite key keeps quorums
scoped: the same authority signaling readiness for two keys
produces two independent entries.
Record path:
- `record_network_key_dkg_ready_signal` is idempotent on replays.
Quorum is per-key (sum stake of all authorities that signaled
for `signal.network_key_id`). The first quorum of *any* signal
kind — epoch-wide or per-key — calls `freeze_mpc_data_if_first`,
which is already idempotent on a non-empty frozen set. Per-key
quorums after that point are still recorded (DKG kickoff per key
consumes them) but don't re-freeze.
- `has_network_key_dkg_ready_quorum(network_key_id)` exposes the
per-key quorum state for step 14's session-kickoff gating.
Consensus wiring:
- New `ConsensusTransactionKind::NetworkKeyDKGReadySignal` +
matching `ConsensusTransactionKey` variant.
- `new_network_key_dkg_ready_signal` constructor.
- Sender-authority check at verification time (consensus binding
is the only authentication; no payload signature).
- Metric label + validator pass-through arms.
Producer helper:
- `build_network_key_dkg_ready_signal_transaction(authority,
network_key_id, epoch)` wraps a signal in a
`ConsensusTransaction` ready for submission.
Tests: 1 new unit test on `AuthorityEpochTables`'s
`network_key_dkg_ready_signals` table covers composite-key
scoping + replay idempotency.
Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 142.54s.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Filters the frozen mpc_data input set down to the union of the current and next committees before it's consumed by handoff cert build (and, in step 14, reconfig MPC). Validators who announced mpc_data this epoch but withdrew before next_committee was selected get dropped — the cert no longer pins their entries and reconfig MPC won't allocate work for them. `compute_effective_reconfig_input_set(frozen, current, next) -> BTreeMap<AuthorityName, [u8;32]>` is the pure helper; it intersects with the union of both committee membership lists. Both committee inputs are `IntoIterator` so callers can hand it whatever shape they already have (Vec, &[..], `voting_rights` iter). `AuthorityPerEpochStore::get_effective_reconfig_input_set` reads the frozen set and the current committee from the store and delegates to the pure helper. `build_local_handoff_attestation` now goes through this method instead of pulling `frozen` raw, so cert items reflect the effective set. Tests: 2 new unit tests cover the intersection semantics — a four-author scenario where staying members, joiners, and withdrawers each take their expected path through the filter, plus the degenerate case where no announcer overlaps the committees. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 143.88s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the read-side abstraction that lets the sui_syncer prefer locally-cached protocol output blobs over the chain blobs when assembling `DWalletNetworkEncryptionKeyData`. The lightweight fields (id, current_epoch, dkg_at_epoch, state) always come from chain — those are authoritative — but the large `network_dkg_public_output` and `current_reconfiguration_public_output` blobs can come from the local content-addressed cache populated by step 9's producer caching. New in `ika-core::validator_metadata`: - `NetworkKeyBlobSource` trait: `network_dkg_output_blob(key_id)` and `network_reconfiguration_output_blob(key_id)`, both returning `Option<Vec<u8>>`. `None` means "fall back to chain". - `StaticNetworkKeyBlobSource` — empty-by-default in-memory impl, used by tests and as the typed-empty default. - `fetch_network_key_data_with_off_chain_blobs(chain_data, source) -> DWalletNetworkEncryptionKeyData`: takes the chain copy, overlays each large blob from `source` if present. `AuthorityPerEpochStore` implements `NetworkKeyBlobSource` by looking up the per-epoch digest cache from step 9 (`network_dkg_output_digests` / `network_reconfiguration_output_ digests`) and then fetching the blob bytes from the perpetual `mpc_artifact_blobs` store. A missing digest *or* a missing blob returns `None` — every step in the chain has the chain fallback behind it. Syncer wiring (replacing the chain-read in `sui_syncer::sync_dwallet_network_keys` with the wrapper) is the next commit; this one lays the infrastructure. Tests: 2 new unit tests cover the overlay semantics — partial overlay (DKG from source, reconfig from chain) and the all-fall-back case where the source is empty and the merged data equals the chain copy byte-for-byte. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 142.76s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the off-chain assembler for the load-bearing
`Committee.class_groups_public_keys_and_proofs` map — the
HashMap reconfig MPC reads to find each committee member's
class-groups encryption key + correctness proof. The new path
decodes blobs locally from the perpetual `mpc_artifact_blobs`
store, keyed by digests pinned in the validators'
`ValidatorMpcDataAnnouncement`s.
The completion gate (per the design memo) is strict:
`assemble_committee_class_groups_off_chain` returns
`OffChainClassGroupsAssembly::Complete(map)` *only* when every
supplied authority resolved successfully — blob found, BCS-
decoded to `VersionedMPCData`, inner bytes decoded to
`ClassGroupsEncryptionKeyAndProof`. Even one missing or
malformed entry forces `Incomplete { missing: [...] }`, and the
caller must fall back to the chain-read path.
Why strict: reconfig MPC reads
`Committee.class_groups_public_keys_and_proofs[authority]`
directly, and a missing/empty entry silently drops that
validator's share without aborting. The existing chain-read path
in `sui_syncer::new_committee` already has this footgun (a
`filter_map` that swallows decode errors per-validator); the
off-chain path *must not* repeat it. Hence: all-or-nothing.
Wiring `sui_syncer::new_committee` to try off-chain first and
fall back on `Incomplete` is the next commit; this commit lands
the pure assembler.
Tests: 3 new unit tests cover (a) the happy path — two seeded
blobs round-trip through `derive_mpc_data_blob` →
`mpc_data_blob_hash` → an in-memory store → assembly back into
the map; (b) missing-blob aborts with the missing authority
listed; (c) corrupt-blob (bytes don't decode as
`VersionedMPCData`) also aborts.
Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 143.26s.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
DKG and reconfig sessions now wait on the off-chain mpc_data freeze before instantiating. Honest validators that observe the chain event before the consensus-side freeze quorum lands park the request and retry on every subsequent batch cycle until the gate opens. Gate conditions, evaluated against the per-epoch store: - `NetworkEncryptionKeyDkg(key_id)` requires `is_mpc_data_frozen() && has_network_key_dkg_ready_quorum(key_id)`. Per-key quorum makes a stronger commitment than the epoch-wide signal: it certifies that this *specific* key has enough peers ready to actually participate. - `NetworkEncryptionKeyReconfiguration(_)` requires only `is_mpc_data_frozen()`. Reconfig sweeps every key the validator knows about; a per-key gate would deadlock if the per-key quorum needed reconfig output for kickoff. - Everything else (user DKG, presign, sign, etc.) is unaffected. `AuthorityPerEpochStoreTrait` gains the two query methods `is_mpc_data_frozen` and `has_network_key_dkg_ready_quorum`, implemented concretely against `frozen_validator_mpc_data_input_set` and `network_key_dkg_ready_signals` respectively. The previously inherent-only `has_network_key_dkg_ready_quorum` is gone — it's now exclusively a trait method. `TestingAuthorityPerEpochStore`'s impls return `Ok(true)` for both: integration tests don't drive the freeze flow end-to-end and would otherwise deadlock at the gate. Production builds use the real store where these reflect actual consensus-observed state. In the manager, a new `requests_pending_for_frozen_mpc_data: Vec<DWalletSessionRequest>` queue mirrors the existing pending queues. Drained at the top of every `handle_mpc_request_batch` by re-running each request through `handle_mpc_request`. Requests that don't pass get re-queued; those that do proceed through the existing kickoff path. Made `DWalletMPCManager.epoch_store` `pub(crate)` so the gate check in `mpc_session.rs` can reach it. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 144.14s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the producer-side task without which the off-chain freeze quorum can never be reached, leaving step 14's kickoff gate permanently closed and stalling network DKG / reconfig. The new `MpcDataAnnouncementSender` (sibling of `EndOfPublishSender` under `sui_connector`) runs once per epoch per validator and: 1. Derives the canonical class-groups `mpc_data` blob from the validator's `RootSeed` (via `derive_mpc_data_blob` — identical bytes to what the CLI submits on chain). 2. Persists the blob into perpetual `mpc_artifact_blobs` so peers can fetch it by digest over the existing `GetMpcDataBlob` RPC. 3. Signs and submits a `ValidatorMpcDataAnnouncement` over consensus. Submission is idempotent — replays use the latest- by-timestamp rule. 4. After its own announcement is in, submits an `EpochMpcDataReadySignal` — one of two signal types whose quorum drives `freeze_mpc_data_if_first`. 5. Submits `NetworkKeyDKGReadySignal` for every known network key (deduped via a `HashSet`). Each of (3), (4), (5) is gated by its own one-shot flag plus ack-on-success, so a transient consensus-adapter failure causes a retry on the next tick (every 2s) rather than blowing up the task. Step-14 gate softened to match the design memo's "first quorum of either signal type freezes mpc_data" — DKG kickoff now only requires `is_mpc_data_frozen()`, same as reconfig. The per-key signal stays as an alternate freeze trigger but isn't a separate hard requirement, since the sui_syncer skips `AwaitingNetworkDKG` keys from the network-keys snapshot, meaning the producer task can't observe a fresh DKG-target key to signal for until *after* DKG completes — which would deadlock. Wired from `ika-node::monitor_reconfiguration` alongside `EndOfPublishSender`. `AuthorityState::perpetual_tables()` added to expose the perpetual handle without making the field public. The aborted-on-epoch-end pattern follows `end_of_publish_sender_handle`. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 143.64s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Lights up step 6's joiner verify path by installing a
`StaticJoinerPubkeyProvider` on the current epoch store, sourced
from the next-epoch committee snapshot already kept live by
`sui_syncer::sync_next_committee` and exposed via
`next_epoch_committee_receiver`. Without this, every next-epoch
(joiner) `ValidatorMpcDataAnnouncement` drops silently because the
provider field is `None` by default.
The new per-epoch `JoinerPubkeyProviderUpdater` task watches the
receiver, computes the joiner set as `V_{e+1}.voting_rights`'s
authority names, and calls
`AuthorityPerEpochStore::install_joiner_pubkey_provider`. Since
`AuthorityName == AuthorityPublicKeyBytes`, the BLS sig verify in
`verify_joiner_announcement` runs against the announcer's claimed
authority directly — no separate pubkey lookup needed.
Idempotent: `last_installed` cache short-circuits re-installation
when the underlying set is byte-identical to the last one we
installed.
This is a *simplification* of the design memo's "verify against
PendingActiveSet" prescription: we wait until V_{e+1} is selected
on chain instead of reading `PendingActiveSet` directly. Trade-off
— joiners can't announce earlier than V_{e+1} selection, but
reading the `ExtendedField` for PendingActiveSet would require a
new Sui dynamic-field plumbing path that isn't justified for v1.
Early-announce can be added later if join-latency becomes a real
concern.
Spawned alongside the producer task in
`monitor_reconfiguration`; aborted on epoch end via the same
pattern as `end_of_publish_sender_handle`.
Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 271.18s.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes the verify side of step 7's handoff loop. Without this, the `ConsensusPubkeyProvider` field stays `None` and every incoming `HandoffSignatureMessage` drops as `UnknownSigner` — meaning no peer's signature ever counts toward the aggregator's quorum and the cert never gets minted. The new `ConsensusPubkeyProviderUpdater` task fetches the current committee's `StakingPool.validator_info.consensus_pubkey_bytes` directly via `sui_client.get_system_inner()` → `active_committee.members` → `get_validators_info_by_ids` → `verify().consensus_pubkey`. The result is mapped `AuthorityName -> Ed25519PublicKey` and installed as a `StaticConsensusPubkeyProvider` on the per-epoch store. Cadence: 15s (consensus pubkey is fixed at validator registration and shouldn't change mid-epoch). Idempotent re-install via a base64-serialized cache key on the last installed map. Sources the system inner directly rather than plumbing `system_object_receiver` out of `SuiSyncer` — one extra RPC every 15s is cheaper than the receiver-broadcast plumbing. Wired in `monitor_reconfiguration` alongside the joiner-pubkey-provider updater and the producer task; aborted on epoch end via the same pattern as `end_of_publish_sender_handle`. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 209.13s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wires step 12's overlay into the chain-read path. The syncer's `sync_dwallet_network_keys` task now applies `fetch_network_key_data_with_off_chain_blobs` to every chain copy before sending it on the watch channel, so consumers see locally- cached DKG / reconfig output blobs (populated by step 9's producer cache) instead of fetching them from chain on every re-read. Plumbing: - `SuiConnectorService` gains `network_key_blob_source: Arc<ArcSwapOption<Box<dyn NetworkKeyBlobSource>>>` plus an `install_network_key_blob_source` method. - The handle is created (empty) at service construction and passed by clone into the syncer task, where `sync_dwallet_network_keys` reads it on each fetch tick. - New adapter `EpochStoreBlobSource` wraps `Weak<AuthorityPerEpochStore>` so the long-lived service can hold a per-epoch reference; the weak upgrade returns `None` cleanly when the epoch ends, which makes the overlay fall back to the chain blob via `unwrap_or` on each field. - `ika-node::monitor_reconfiguration` calls `sui_connector_service.install_network_key_blob_source(...)` once per epoch with a fresh `EpochStoreBlobSource` pointing at the new `cur_epoch_store`. Each install atomically replaces the previous epoch's source. The lightweight metadata (id, current_epoch, dkg_at_epoch, state) always comes from chain — only the two large output blobs may be overlaid. When no source is installed, behavior is unchanged byte-for-byte. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 202.94s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wires step 13's pure assembler (`assemble_committee_class_groups_off_chain`) into the next-committee construction path. When the off-chain set covers every committee member, the resulting class-groups public-keys-and-proofs map comes straight from validators' own `mpc_data` announcements + the perpetual blob store instead of refetching from chain. `Incomplete` paths transparently fall through to the existing `get_mpc_data_from_validators_pool` read. New abstractions in `validator_metadata`: - `OffChainCommitteeClassGroupsSource` trait — single method `try_assemble_class_groups(&[AuthorityName]) -> OffChainClassGroupsAssembly`. - `EpochStoreClassGroupsSource` adapter holds `Weak<AuthorityPerEpochStore>` (for the per-authority announcement digest lookup) + `Arc<AuthorityPerpetualTables>` (for the digest→bytes blob lookup), and delegates to the pure assembler. Returns `Incomplete` cleanly when the weak upgrade fails (epoch ended). Plumbing: - `SuiConnectorService` gains a second `Arc<ArcSwapOption<Box<dyn OffChainCommitteeClassGroupsSource>>>` handle with a matching `install_class_groups_source` setter. - The handle is passed by clone into `SuiSyncer::run` and on to `sync_next_committee` → `new_committee`, where the off-chain attempt happens before the chain read. - `ika-node::monitor_reconfiguration` installs a fresh `EpochStoreClassGroupsSource` once per epoch right next to the blob-source install. Each install atomically replaces the previous epoch's source. Strict-gate rationale preserved: `new_committee` only short- circuits to the off-chain map on `Complete`. Any missing authority — joiner whose announcement hasn't been verified yet, blob not yet replicated, decode failure — falls through to chain, which is the only safe option since the load-bearing rule says reconfig MPC silently drops validators with no class-groups entry. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 265.04s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wires the consumer side of step 5. The Anemo
`SubmitMpcDataAnnouncement` handler had been returning
`Rejected{"relay not installed"}` for every joiner submission;
this commit installs a concrete relay per epoch so the RPC
actually forwards joiner announcements into consensus.
The relay (`ConsensusBackedAnnouncementRelay` in
`sui_connector::announcement_relay`) runs three steps:
1. Cheap envelope checks — refuses unless
`announcement.epoch == next_epoch`, since current-epoch
announcements come from members who can submit themselves
directly.
2. Joiner verify via the pure
`validator_metadata::verify_joiner_announcement` against the
per-epoch store's installed `JoinerPubkeyProvider` (populated
by the joiner-provider syncer from step 6). Rejection here
stops a malicious peer from using us as a spam pipe.
3. Wraps in `ConsensusTransaction::new_validator_mpc_data_announcement`
and submits via the consensus adapter.
Plumbing:
- `P2pComponents` gains a `mpc_announcement_relay` field
(`Arc<AnnouncementRelayHandle>`) so the long-lived handle the
Anemo server already holds is also reachable from
`monitor_reconfiguration`.
- `IkaNode` stashes the same handle so the per-epoch install
loop can swap relays without re-touching the network layer.
- New `AuthorityPerEpochStore::joiner_pubkey_provider()` getter
exposes the installed provider for the relay's verify step
(mirrors the existing install/clear pair).
Install point: alongside the other per-epoch installs in
`monitor_reconfiguration`. Each epoch's relay holds
`Weak<AuthorityPerEpochStore>` so it naturally fails closed when
the epoch ends (returns "epoch ended" until the new epoch's
relay replaces it).
Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 247.16s.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reorganizes the four files that have no Sui RPC dependency and shouldn't have been under `sui_connector/`. They all just hold a `Weak<AuthorityPerEpochStore>` + an `Arc<dyn SubmitToConsensus>` and run as per-epoch background tasks that emit `ConsensusTransaction`s; that's a different responsibility from `sui_connector/` (which talks to Sui RPC). Moved (identical bytes): - `sui_connector/end_of_publish_sender.rs` → `epoch_tasks/end_of_publish_sender.rs` - `sui_connector/mpc_data_announcement_sender.rs` → `epoch_tasks/mpc_data_announcement_sender.rs` - `sui_connector/joiner_pubkey_provider_updater.rs` → `epoch_tasks/joiner_pubkey_provider_updater.rs` - `sui_connector/announcement_relay.rs` → `epoch_tasks/announcement_relay.rs` Kept in `sui_connector/`: - `consensus_pubkey_provider_updater.rs` — actually calls `sui_client.get_system_inner()` + `get_validators_info_by_ids`, so it belongs with the Sui-side updaters. The four moved files use only `crate::` paths internally so no import edits inside them; the only external rename is in `ika-node/src/lib.rs` (s/sui_connector/epoch_tasks/ on four call sites). Module layout follows the CLAUDE.md `xxx.rs` convention: new `crates/ika-core/src/epoch_tasks.rs` declares the four submodules, files live in `epoch_tasks/`. No `mod.rs`. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 144.80s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three structural changes so the handoff loop is generic and not phrased as a validator-metadata feature: 1) Types extracted to `ika-types::handoff`. `HandoffItemKey`, `HandoffAttestation`, `HandoffSignatureMessage`, and `CertifiedHandoffAttestation` move out of `validator_metadata.rs`. `validator_metadata.rs` keeps only the four validator-specific types (`ValidatorMpcDataAnnouncement`, `SignedValidatorMpcDataAnnouncement`, `EpochMpcDataReadySignal`, `NetworkKeyDKGReadySignal`). Cross-crate import sites updated. 2) `HandoffSignatureSender` extracted from `EndOfPublishSender`. The latter shrinks back to "submit EndOfPublish on the local trigger" and nothing else. The new sender lives in `epoch_tasks/handoff_signature_sender.rs` and runs on the same `end_of_publish_receiver` independently. ika-node spawns both side-by-side and aborts both on epoch end. 3) `HandoffItemsBuilder` trait + concrete `MpcDataHandoffItemsBuilder`. Item contributors plug in via the trait; `AuthorityPerEpochStore::build_local_handoff_attestation` now takes `&[Arc<dyn HandoffItemsBuilder>]` and folds each contribution into the attestation. Today only the MPC-data builder is registered (via `default_handoff_items_builders`); new features (NOA, sui-state pinning, etc.) can append their own builder without touching the producer or aggregator. `HandoffItemKey` stays a typed enum for now — moving to opaque byte keys was the fourth level I called out and explicitly deferred. Adding a new item kind still requires a variant bump, which is the right trade-off while the variant count is small. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 295.42s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The module name `validator_metadata` was misleading — it bundled
three orthogonal P2P endpoints that have nothing to do with
"validator metadata" in the dictionary sense. Rename to
`mpc_artifacts` and split into purpose-named submodules:
- `mpc_artifacts/blob_store.rs` — content-addressed `mpc_data`
blob storage (`MpcDataBlobStorage`, `InMemoryBlobStore`,
`mpc_data_blob_hash`, `GetMpcDataBlobRequest`, `MpcDataBlob`,
`fetch_blob`).
- `mpc_artifacts/announcement_relay.rs` — joiner announcement
forwarding (`AnnouncementRelay`, `AnnouncementRelayHandle`,
`SubmitMpcDataAnnouncement{Request,Response}`,
`submit_announcement_to_peer`,
`submit_announcement_to_committee`).
- `mpc_artifacts/handoff_cert.rs` — handoff cert retrieval
(`HandoffCertStorage`, `GetCertifiedHandoffAttestationRequest`,
`fetch_certified_handoff_attestation`).
- `mpc_artifacts/server.rs` — Anemo `ValidatorMetadata` impl,
unchanged behavior (moved + import paths fixed).
- `mpc_artifacts.rs` — top-level module: `mod generated`,
submodule declarations, re-exports of every public surface so
external callers still write `ika_network::mpc_artifacts::X`
without caring which submodule X lives in, and the public
`build_server` constructor.
Anemo service wire name stays `ValidatorMetadata` (and the
codegen include stays `ika.ValidatorMetadata.rs`) — the
rename is internal-only, no protocol break. Tests for each
submodule moved next to their code (blob_store + relay tests).
External rename: `ika_network::validator_metadata` →
`ika_network::mpc_artifacts` across ika-core, ika-node, ika-types
inline paths, and ika-network's own build.rs request_type /
response_type paths.
Acceptance gate: `cargo test --release -p ika-core
test_network_dkg_full_flow` — 1 passed in 265.88s.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a single `off_chain_validator_metadata` feature flag and bumps `MAX_PROTOCOL_VERSION` from 4 to 5; the flag flips on at v5. All off-chain pipeline hooks now check this flag and fall back to legacy chain-only behavior when false. The Sui-style protocol- version advance means every validator switches together at the exact consensus round the network advances to v5 — no mixed- version freeze-quorum stalls, no asymmetric blob caches, no divergent handoff attestations. Six gates, all failing closed to legacy: 1. Producer tasks self-exit on `run()` when the flag is false: `MpcDataAnnouncementSender`, `HandoffSignatureSender`, `JoinerPubkeyProviderUpdater`, `ConsensusPubkeyProviderUpdater`. Each reads `epoch_store.protocol_config().off_chain_validator_metadata_enabled()` once at task start. 2. ika-node `monitor_reconfiguration` reads the flag once per epoch and skips spawning the four tasks, the relay install, and the two `SuiConnectorService` source installs (`install_network_key_blob_source`, `install_class_groups_source`) when off — saves the spawn churn even though the tasks self-gate. `EndOfPublishSender` stays unconditional since it's core-protocol. 3. Consumer record paths bail early when the flag is false — defensive, so a stray new-kind `ConsensusTransaction` from a peer can't allocate state: `record_validator_mpc_data_announcement`, `record_epoch_mpc_data_ready_signal`, `record_network_key_dkg_ready_signal`, `record_handoff_signature`. 4. Step-14 kickoff gate `off_chain_gate_passes` evaluates to `true` (legacy behavior) when the flag is off. Otherwise gates on `is_mpc_data_frozen()`. New trait method `off_chain_validator_metadata_enabled` on `AuthorityPerEpochStoreTrait` so the gate site can reach the flag through the trait object. `TestingAuthorityPerEpochStore` returns `true` to preserve existing integration-test behavior. 5. Step-9 producer cache hook in `DWalletMPCService::new_dwallet_mpc_output` skips when the flag is off — leaves the digest tables empty so the syncer overlay path naturally falls through to chain-only reads. 6. Syncer overlays (`sync_dwallet_network_keys`, `new_committee`) don't need explicit flag checks: when the flag is off, ika-node skips `install_*_source`, the source handles stay None inside `SuiConnectorService`, and the existing source-handle checks fall through to chain. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` — 1 passed in 313.64s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Brings PR #1 (cleanup, ika-benchmark removal), PR #2 (bootstrap library), PR #3 (ika-test-cluster), and the Inkrypto cryptography-private bump (post-PR-#1707 `ValidatorEncryptionKeysAndProofs` shape: class-groups + per-curve PVSS HPKE). Merge resolutions: * `authority_per_epoch_store.rs`: take origin/dev's tuple key `DBMap<(SessionIdentifier, u16), AssignedPresign>` for `assigned_presigns_schnorrkel_substrate` (PR #1707 fix) AND keep the seven off-chain metadata fields from this branch. * `pnpm-lock.yaml`: keep upstream `sdk/signature-mpc-wasm/pkg: {}` entry; the stale stashed `sdk/ows/...` entries are already removed. * `protocol-config/lib.rs`: keep `MAX_PROTOCOL_VERSION = 4`. Merge `network_encryption_key_version = Some(3)` and `reconfiguration_message_version = Some(3)` into the v4 arm so the Inkrypto crypto activates at the current MAX. The v5 arm (`noa_checkpoints = true`) is commented out as a forward-looking reference. Rewrote the version-history comment with one line per version. User's manual `internal_presign_sessions = false` at v4 preserved. * Off-chain pipeline PVSS extension: the Inkrypto bump expanded `Committee::new` with three new PVSS HashMaps (secp256k1, secp256r1, ristretto). Extended `OffChainCommitteeClassGroupsSource` to assemble all four maps from the same blob bytes via the shape- tolerant `decode_validator_encryption_keys`. Validators publishing under mainnet-v1.1.8 shape contribute only class-groups; post-PR-#1707 validators contribute the full bundle — matching chain-fallback semantics in `sui_syncer::new_committee`. * Test-only `Committee::new` call sites in `validator_metadata.rs`: pass three empty PVSS maps to satisfy the new 8-arg signature. * Protocol-config snapshots regenerated for v3/v4 (off-chain flag flipped on at v4, crypto-v3 active at v4) plus v5 snap files kept on disk as forward-looking reference for the commented v5 arm. Acceptance gate: `cargo test --release -p ika-core test_network_dkg_full_flow` passes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Make `request_add_validator_candidate`, `request_add_validator`, and `stake_ika` `pub` in `ika-swarm-config::sui_client` so the upcoming `IkaTestCluster` joiner helper can reuse the battle-tested PTB builders rather than duplicating them. No behavior change — same functions, broader visibility. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Scenario::with_base_dir keeps node logs after a failure (default temp dir is cleaned on drop, which hid the v1.1.8 boot panic). with_epoch_timeout for slower heterogeneous runs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The cross_binary test passes end-to-end (722s): 4 out-of-process validators boot on a v3-only binary, complete network DKG, then all swap to dev (v3..v4) and the capability vote advances v3 -> v4 at the next epoch. Exercises the protocol-vote arithmetic, mid-epoch reconfiguration across the swap, mixed- committee wire compat, and on-disk compat (restart on new binary, old RocksDB). Tuning that made it pass: 10-min epochs and swap-all-then-one-transition, to avoid the known sui_executor gas-coin-contention epoch wedge (short epochs + swap churn froze the notifier's advance-epoch executor) and to keep each swap clear of the mid-epoch reconfiguration window. Scenario gains with_epoch_duration_ms / with_epoch_timeout. The OLD binary is a dev build pinned to MAX_PROTOCOL_VERSION=3 (same crypto as dev, differs only in advertised version). The literal mainnet-v1.1.8 ika-node is crypto-incompatible (inkrypto vs cryptography-private class_groups; v4 key- shape change) and cannot share a committee with dev — a finding documented in the test, confirming the dual-pin premise of docs/plan-update-crypto-latest.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
workload.rs: issue_dkg returns the txn digest (completion is confirmed via the
coordinator session counter, not a per-dWallet read); protocol-pp fetch retries
on a partial-TableVec decode error. tests/workload.rs (GREEN) proves the
submission path end-to-end: protocol params from the on-chain network key,
centralized Curve25519 party, coordinator txn executes.
KNOWN GAP (documented in the test + results doc): the coordinator ignores the
emitted event ("not a DWalletSessionEvent"), so the session does not complete —
the driver must call register_encryption_key before the DKG (as the TS SDK
does). Presign/Sign build on a completed DKG and are not implemented.
docs/cross-binary-upgrade-testing-results.md summarizes what was built, the
green go/no-go + cross-binary(v3->v4) runs, the v1.1.8 crypto-incompat finding,
the epoch-wedge tuning, and the workload gap.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on-chain tests/workload.rs passes (~415s): a user dWallet completes register-encryption- key -> DKG(Active) -> presign(verified) -> sign, the sign confirmed on-chain via the coordinator's user completed_sessions_count. Proves the session-lifecycle invariant (sessions started in an epoch complete; no silent drops). The driver orchestrates the canonical `ika` CLI (the tested Rust client) rather than re-deriving the user-side 2PC. Making it reliably green surfaced real system properties, all handled: - dedicated, separately-funded user (faucet SUI + IKA transfer) — sharing the publisher key with the notifier causes coin-lock contention; - register-encryption-key before create (encrypted DKG borrows the user key from the coordinator); - v4 genesis (internal_presign_sessions is a v4 feature); - 30-min epoch so the lifecycle runs clear of the mid-epoch reconfiguration; - confirm sign via on-chain completed-count, not the CLI's racy --wait poll. Adds shared-crypto + fastcrypto deps for the IKA-funding transfer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…it, buffer-stake override Hardening surfaced while driving a real mainnet-v1.1.8 -> dev rolling swap: - process.rs: swap now stops the node with SIGTERM (which ika-node's `wait_termination` handles for an orderly shutdown) and waits for clean exit, with SIGKILL only as a fallback. The previous hard SIGKILL interrupted the node mid-consensus-round and left dwallet-MPC replay state partial on disk, which crashed the next binary on replay (`consensus round mismatch ...`). A binary swap is a planned restart, not a crash, so it must be graceful. - cluster.rs: document that the epoch counter advancing to N is itself the completion signal for epoch N-1 (reconfiguration into a new epoch is gated on that epoch's network-key MPC finishing), so callers wait for the epoch *after* the work they depend on rather than polling key state. - scenario.rs / process.rs: add a `set_buffer_stake(bps)` step (POST /set-override-buffer-stake) so a quorum, not unanimity, advances the protocol version. With n=4 the default 50% buffer rounds up to requiring all four votes; a rolling swap can leave one validator's fresh capability uncommitted at the boundary tally. - cross_binary.rs: wait for epoch 2 before swapping (the genesis network DKG runs during epoch 1, so epoch 2 guarantees it finished under the old binary), drop the buffer stake to a quorum, then wait for epoch 3. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t; gate internal-presign read on v4 Wire the cross-binary upgrade path against the off-chain-metadata branch so a mixed-version committee survives a rolling binary swap: - verify_validator_keys decodes whatever class-groups key shape is on-chain (bare mainnet-v1.1.8 `ClassGroupsEncryptionKeyAndProof` or the post-bump combined `ValidatorEncryptionKeysAndProofs`) via `decode_validator_encryption_keys`, comparing only the class-groups component that identifies the seed. PVSS keys are verified off-chain on the assembly path. - validator_initialization_config publishes the BARE mainnet-v1.1.8 shape on-chain (the richer bundle travels off-chain via validator P2P), so a v1.1.8 binary can still decode the record during the upgrade window. - The internal-presign output read is gated on `internal_presign_sessions_enabled()` (a v4 feature) so a pre-v4 node mid rolling-upgrade skips the sparse stream instead of panicking on the dense per-round assertion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- bin default scenario `rolling_majority_then_minority` mirrors the proven-good config (10-min epochs + 1800s wait timeout + `set_buffer_stake(0)` before the upgrade-crossing wait) so the v3->v4 vote can land under n=4. - `wait_for_epoch` logs a failed `current_epoch` read instead of silently treating it as epoch 0 until the deadline. - Document that the workload sign-completion check (coordinator user `completed_sessions_count`) is sound only because the harness drives a single dedicated user with one sign in flight. - results doc: caveat the cross_binary GREEN row (version-only swap; the real v1.1.8 crypto-boundary swap is not exercised) and note the single-instance / fixed-port (9000/9123) constraint. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
8672c96 to
f858d37
Compare
…te + read) Same fix landed on dev via #1728, applied here on the off-chain-metadata file structure. The consensus-output replay loop asserts each per-round table's record round equals the driver round (`dwallet_mpc_messages`). Tables added after mainnet-v1.1.8 are sparse when a dev binary replays a v1.1.8-written RocksDB (rolling binary swap), tripping the assertion. Gate write + read of each on the feature that introduced it: - internal_presign_sessions: dwallet_internal_mpc_outputs, global_presign_requests, idle_status_updates - noa_checkpoints: verified_system_checkpoint_messages, noa_observations, sui_chain_observation_updates (`network_key_data_messages` is already removed on this branch by the off-chain work, so it needs no gate here.) Validated by the cross-binary upgrade harness: mainnet-v1.1.8 -> dev rolling swap reaches protocol v4. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts: # crates/ika-core/src/authority/authority_per_epoch_store.rs # crates/ika-core/src/dwallet_mpc/dwallet_mpc_service.rs
…erified handoff cert Replace the barrier's readiness condition 2. It previously read chain-derived fields off the `network_keys_receiver` overlay (`current_epoch >= next_epoch` plus a non-empty `current_reconfiguration_public_output`), which the `snapshot_ready_for_signing` gate deliberately avoids because the overlay can surface the prior epoch's output a round behind via the perpetual mirror — so a non-empty value there does not prove THIS epoch's reconfiguration is local. Now the barrier decides readiness off the same off-chain signals everything else trusts: the verified `cur_epoch` handoff cert (the cross-epoch trust anchor) plus this validator's local reconfiguration-output digest slice. The cert's single `epoch` scopes the whole handoff, so there is no per-key epoch to check — only that every `NetworkReconfigurationOutput` item the cert certifies is held locally with a matching digest (`all_cert_reconfiguration_outputs_held_locally`). `prepare_handoff_anchor` now returns the cert so the caller reads its items directly, and the chain-fed `network_keys_receiver` dependency (and the seam blob-source pre-install that only existed to feed it) are dropped. Also fix a wedge this exposed: holding the cert does NOT imply holding the outputs it certifies. A lagging validator can adopt the cert from a buffered peer-signature quorum without ever computing or caching those outputs, so the persisted-cert fast path now fetches + caches them too (idempotent) — otherwise a cert-but-no-outputs validator blocks at the barrier forever, never enters the epoch, and never publishes its mpc_data. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…c_data set at the epoch boundary A resource-slow validator would lag the epoch handoff and wedge reconfiguration in three independent ways. This fixes all three so the boundary converges on the full committee instead of locking at a bare quorum. 1. Persist the cert from observed signatures, not just from signing. `record_handoff_signature` buffered peer signatures until this validator computed its own attestation; a validator whose snapshot lagged never did, so it never persisted the cert and had to re-fetch its own prior-epoch cert at the next boundary. Now, once the buffered peer signatures show a stake-quorum agreeing on one attestation (`quorum_attestation_in_buffer`), adopt it and persist the cert from the observed quorum (replay re-verifies every signature, so a byzantine member can neither forge the cert nor block a real quorum). 2. Freeze the mpc_data input set only when a DKG/reconfiguration actually starts AND a quorum is present — not prematurely at epoch start. The freeze used to fire on the first ready-signal quorum, on a wall-clock deadline the long genesis-DKG transition had already consumed, locking the set at sub-full coverage before slower validators' mpc_data propagated. It now fires from the DKG/reconfiguration session gate (`freeze_mpc_data_if_quorum`), which a request reaches only after the next active committee is published (mid-epoch) — by which point coverage is complete, so the frozen set holds every member. 3. Defer the epoch close a configurable number of consensus rounds past the EndOfPublish quorum so straggler `EndOfPublishV2` bundles — which carry handoff signatures — are sequenced before the epoch closes. The close (factored into `build_epoch_close_checkpoint_messages`) now fires at the commit boundary once every committee member has voted OR the leader round has advanced `end_of_publish_grace_rounds` (new protocol constant, default 50) past quorum. Measured as a leader-round delta (rounds skip — not +1 per commit), and the anchor round is persisted so a validator restarting mid-grace closes at the same round as its peers (the final checkpoint must be deterministic). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… feat/ika-upgrade-test
…ore driving The workload waited for epoch 1 (genesis, reached immediately) and drove the dWallet lifecycle right away, with a 30-minute epoch. At v4 the genesis network-key DKG is gated on the off-chain mpc_data freeze, whose ready-signal — with no next-epoch committee published yet at genesis — only fires at the 3/4*epoch_duration deadline (~22 min on a 30-min epoch). So the network key wasn't on-chain when the CLI tried to derive protocol parameters, and the driver's ~2-min retry budget gave up long before. Wait for epoch 2 instead of 1: the counter advancing to 2 is itself the completion signal for the genesis network DKG (reconfiguration into epoch 2 reshares that key, which can't happen until the DKG finished), so the key is guaranteed readable — and don't drive the lifecycle before then, when it could only fail. Shorten the epoch to 4 min so the freeze deadline (~3 min) is reachable while still clearing validator bring-up + announcement recording (~90s) and leaving the lifecycle room inside the next epoch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… mid-epoch at the v3->v4 boundary The off-chain network-key blob overlay is keyed by key ID only, so the moment this epoch's mid-epoch reconfiguration finalizes locally, the syncer's merged key data starts carrying the output produced *for the next epoch's committee* — shares encrypted to next-epoch party IDs, which need not align with this epoch's (on-chain committee order is not stable across epochs). In steady-state v4 the cert anchor in `adopt_cert_verified_keys` rejects it (the prior epoch's handoff cert pins the output produced FOR the current epoch), but the first v4 epoch after the v3->v4 upgrade has no prior cert, fell into the cert-less boundary path, and adopted blindly — every validator then failed decryption with ClassGroup(Decryption) using this epoch's identity on next-epoch-dealt shares. Guard the boundary path the same way the cert anchor does: skip adoption when the reconfiguration output's digest matches the one this epoch's own reconfiguration session recorded (epoch-keyed perpetual digest, new point lookup). The next epoch's manager adopts and decrypts it with next-epoch identity at epoch start, as in steady state. Also hoist the last-failed check in the instantiation filter so it applies to every branch: previously the `Some(prev)` branch re-selected the failing bytes every poll tick (they differ from the last *successfully* instantiated ones by definition), re-running a doomed ~18s class-groups decrypt per tick and starving the service loop — checkpoints (including EndOfPublish) never certified, wedging epoch advance behind the decryption failure. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ejections The stale-gas recovery (drop the cached gas ref, floor the re-fetch at the rejected version) only ran when the rejection arrived inside `tx_response.errors`. But the fullnode also rejects at the JSON-RPC layer (ServerError -32002, "Transaction needs to be rebuilt ... object unavailable for consumption"), which surfaces as `Err` from `execute_transaction_block_with_effects` and bailed out before the recovery code — the cached gas ref survived, so every `retry_with_max_elapsed_time!` attempt rebuilt the byte-identical stale tx and re-rejected, wedging checkpoint delivery to Sui for the full one-hour window (observed: dwallet checkpoints stuck behind a gas coin advanced by the shared publisher address in the test cluster, blocking DKG settlement, mid-epoch reconfiguration, and epoch advance). Factor the recovery into `NotifierSubmitState::handle_possible_stale_gas_rejection` and apply it on both paths. Note `IkaError` derives strum's `AsRefStr`, so `err.as_ref()` yields only the variant name — match the `SuiClientTxFailureGeneric` payload to get the actual message. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…grade boundary Genesis at v3 (MIN) instead of v4 (MAX): a v4 *genesis* network DKG is rejected forever (PVSS keys only arrive through the next-committee-only off-chain assembly), so the supported path — and the one mainnet takes — is genesis v3, then upgrade into v4 via the capability vote. The test now waits for epoch 2 (v3 genesis DKG + reshare done), zeroes the buffer stake so the 4-validator vote tallies at bare quorum, waits for epoch 3, asserts protocol >= v4, and only then drives the DKG -> Presign -> Sign lifecycle. This exercises the cert-less v3->v4 reconfiguration-adoption boundary fixed in the previous commits. Remove HANDOFF.md — the reshare-decrypt bug it described is fixed and the workload test is green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e-gas recovery Clippy's `unnecessary_to_owned` suggests `err.as_ref()` for `&err.to_string()` here; it compiles because `IkaError` derives strum's `AsRefStr`, but that returns only the variant name — never the rejection markers — silently disabling the recovery. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…dators Internal presign sessions get their sequence number from a single shared counter, assigned in iteration order over (network key id) x (curve) x (signature algorithm), and the sequence number is bound into the session identifier transcript. Both iteration sources were unordered: - SUPPORTED_CURVES_TO_SIGNATURE_ALGORITHMS_TO_HASH_SCHEMES was a HashMap<u32, HashMap<u32, Vec<u32>>> — iteration order is random per process (RandomState), so each validator walked curves/algorithms in a different order; - the agreed network key ids were iterated straight off a HashMap. Each validator therefore derived *different* session identifiers for the same (curve, algorithm) work. Those sessions could never reach quorum, so they never completed, and the instantiated != completed gate then blocked that algorithm's pool top-ups for the entire epoch. Once a user presign request locked onto the starved pool, the EndOfPublish condition was unsatisfiable and the epoch could not advance. Observed live: in a 4-validator run the validators logged three distinct top-up orders, and exactly the sequence numbers whose (curve, algorithm) assignment happened to agree on 3+ validators completed — the rest hung forever, the ECDSA pool stayed empty all epoch, and the run timed out. A previous green run was a per-process-seed coin flip. Fix: BTreeMap at both nesting levels of the static, and collect the agreed key ids into a BTreeSet before the instantiation loop. Pre-existing bug from the internal sessions instantiation logic (#1638), not specific to this branch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The reconfiguration overlay (next-epoch network key data computed off-chain during reconfiguration) was stored bare and adopted based on a produced-this-epoch digest guard. Instead, store the target epoch alongside the key data and adopt it only when it matches the epoch actually being entered — epoch-correct by construction, no guard heuristics. - validator_metadata.rs: overlay entries carry the epoch they were computed for; lookups take the target epoch. - authority_per_epoch_store.rs / authority_perpetual_tables.rs: persist and reload the epoch alongside the overlay data; drop the digest-guard plumbing. - mpc_manager.rs / sui_syncer.rs: pass the target epoch through adoption and ignore overlay data for any other epoch. Validated by a full workload run: genesis at v3, upgrade into v4 at epoch 3, v3 -> v4 reshare decrypts cleanly on all validators, DKG -> Presign -> Sign lifecycle green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two upgrades to the cross-binary rolling-upgrade test:
1. Committee changes at every epoch boundary, with a different committee
size each epoch (4 -> 3 -> 5 -> 4): a validator removal coincides with
the v3 -> v4 protocol bump, two brand-new validators join via the full
candidate -> stake -> activate flow (their class-groups keys registered
on-chain, so the v4 reshare encrypts to parties that never held the
key), and a final removal reshapes 5 -> 4. Every boundary after the
first is a real reshare to a different party set.
- sui_client.rs: request_add_validator / request_remove_validator /
candidate registration helpers with explicit sender, shared version,
and cap so the test can drive membership without touching the active
wallet address.
- network_config_builder.rs: configurable min_validator_count (the dip
to 3 is below the protocol default of 4).
- scenario.rs / cluster.rs / process.rs: join_validator,
remove_validator, expect_committee_size scenario steps; spawn /
stop / swap of individual validators on different binaries.
2. Rough per-protocol MPC timing report (mpc_timings.rs): scrape the MPC
duration metrics from each validator after the v3 (old binary) and v4
(new binary, churned committee) workload runs, and print a comparison
table at the end of the run. Informational and flagged, not asserted —
wall-clock on a loaded developer machine is too noisy to gate on.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…set post-upgrade The cross-binary churn test wedged at the v3 workload: genesis wrote the full GlobalPresignConfig, routing ECDSA presigns to the global path, which is served exclusively from the validators' internal presign pool — and that pool only fills once internal_presign_sessions activates at protocol v4. At v3 the presign was unservable forever. Genesis now takes GenesisGlobalPresignConfig (Full | Empty). Empty is the mainnet-v1.1.8 on-chain state (the config object must still exist — the coordinator reads it with a bare dynamic-field borrow). The cross-binary scenario uses Empty at genesis and a new SetGlobalPresignConfig step right after the v4 upgrade is confirmed — the same operational ordering a real mainnet rollout must follow: set_global_presign_config only after v4 activates, or ECDSA presigns stall network-wide until it does. Existing genesis-at-v4 tests keep Full (exact current behavior). Also rewords the cross_binary doc comment: the literal v1.1.8 binary failing on harness genesis is a registration-shape artifact (post-#1707 bundle bytes), not a production-direction gap — the new binary reads v1.1.8 keys via the shape-tolerant decode. Verified in the churn run: v3 workload completed (vs infinite wedge), v3→v4 vote passed, post-upgrade config set succeeded. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…le coin
transfer_one_ika took the publisher's first IKA coin — the genesis supply
coin (ika_supply_id) — and transferred it whole to the workload user. The
first churn run to stake a joiner after a workload exposed it: stake_ika
splits the joining stake from ika_supply_id signed by the publisher, which
no longer owned it ("Transaction was not signed by the correct sender").
A second workload on the same cluster would have failed the same way
("publisher owns no IKA").
Pay a fixed 1000-IKA allowance to the workload user instead; the supply
coin stays with the publisher.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s activates The global-presign pipeline is gated on the internal_presign_sessions feature flag (protocol v4) on its consensus side, but session intake diverted every global presign request to the pool unconditionally. Below v4 that strands the request: no pool to serve it, no MPC session spawned, and the session is locked into its epoch — all_current_epoch_sessions_- completed blocks advance_epoch, so the epoch can never end and v4 can never activate. Mainnet's GlobalPresignConfig is already populated (every production ECDSA presign routes to global), so a single presign request in flight after the upgrade restart would have wedged the network at v3 permanently. Gate the diversion on the same flag: pre-activation, the request falls through to a user-requested MPC session — the v1.1.8 serving behavior, whose input (dwallet-output-less presign computation) and output (RespondDWalletPresign with no dwallet_id, VersionedPresignOutput::V2) paths are intact on this branch. Caught by the new v118_upgrade rehearsal: genesis a 4-validator committee on the literal mainnet-v1.1.8 ika-node with the verified mainnet-shape populated GlobalPresignConfig, run the mainnet user flow at v3 (DKG with Universal output, global presign as a user session, sign), atomically swap all validators to the local build, and probe the pre-activation window with a workload that must complete its global presign at v3 via the fallback before the boundary. The run then crosses into v4 (the local binaries reshare the 1.1.8-created network key), serves a pool-backed global presign, and completes one more clean reshare. Also corrects GenesisGlobalPresignConfig and cross_binary docs: Full is the actual verified mainnet on-chain state, Empty is a harness arrangement (and the only targeted-presign coverage), not the mainnet shape. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ith window-delta tables The scraped MPC duration metrics are cumulative per process, so a later snapshot blends everything the validator ever ran — a v3-protocol reshare and a v4-protocol reshare land in one row and the ratio table reads 1.00x. Add a window table to each comparison: (sum2-sum1)/(completions2-completions1) between consecutive snapshots isolates just the work done between them (skipped across a swap, where the counter reset makes the delta negative). Extend the v118 rehearsal past epoch 4 with two new snapshots: - v4-reshare: the first reshare executed under the v4 reconfiguration math (reconfiguration_message_version = 3, PVSS HPKE) — the epoch 2->3 reshare still ran the v3 protocol, so the previous run never measured v4 reconfiguration cleanly; - local-v4-settled: a full lifecycle after the pools finished their initial fill, pricing v4 DKG / pool presign / sign without the boundary work. Run is green (1105s): the v4-math reshare window prices at 53.2s/7.4s/30.8s/9.6s per round vs the local binary's v3-math reshare at 9.5s/2.5s/8.6s/2.8s — with continuous internal-presign pool top-ups sharing the cores. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts: # crates/ika-core/src/dwallet_mpc/mpc_manager.rs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
New additive crate
crates/ika-upgrade-test— an out-of-process test harness that spawns real, separately-compiledika-validatorbinaries against an externalsuilocalnet, swaps binaries on validators across epochs, and drives dWallet workloads. Unlikeika-test-cluster(in-processIkaNode, one binary), it can host genuinely different binaries in one committee. No changes toika-node/ika-swarm.Implements
docs/cross-binary-upgrade-testing*.md; seedocs/cross-binary-upgrade-testing-results.mdfor the full write-up.Tests (all opt-in via env flags; need real binaries + a workspace-tag
sui)tests/smoke.rs(go/no-go)tests/cross_binary.rstests/workload.rsThe cross-binary run exercises, out of process: protocol-vote arithmetic, mid-epoch reconfiguration MPC across the swap, mixed-committee wire compat, and on-disk compat (restart on a new binary against the old RocksDB).
Crate layout
sui.rs— spawn externalsui start --with-faucet --force-regenesis(waits for RPC and faucet).cluster.rs— chain bootstrap viainit_ika_on_sui+ValidatorConfigBuilder+ a notifier fullnode;NodeConfig→ YAML → child; on-chainwait_for_epoch/ protocol-version viaIkaClient.process.rs—ValidatorProcess: spawn / stop /swap_binary, health via the admin RPC.binary.rs—BinarySpec(path / tag / sha / branch) + a sha-keyedgit worktreebuild cache honoring each commit's pinned toolchain.scenario.rs— imperative DSL (start / wait_for_epoch / stop_and_swap / expect_protocol_version).workload.rs— drives a user dWallet lifecycle by orchestrating the canonicalikaCLI; confirms completion on-chain.Key finding:
mainnet-v1.1.8↔devis not a naive rolling swapRunning the harness with the real
mainnet-v1.1.8ika-nodefails at boot for the expected reason, not a harness bug: v1.1.8 linksclass_groupsfromdwallet-labs/inkrypto, dev fromdwallet-labs/cryptography-private, and v4 publishes the combinedValidatorEncryptionKeysAndProofswhere v1.1.8 expects the bareClassGroupsEncryptionKeyAndProof— the exact incompatibility flagged invalidator_initialization_config.rs(⚠️ MAINNET WIRE-FORMAT INCOMPATIBILITY ⚠️). The v1.1.8 node loads its config, connects to Sui, reads the contracts, then fails decoding the on-chain validator record (class groups public key … remaining input). So the harness genuinely runs a different binary and fails on the documented wire-format divergence.To demonstrate a successful heterogeneous upgrade, the green
cross_binaryrun uses an OLD binary that is adevbuild pinned toMAX_PROTOCOL_VERSION=3(same crypto, differs only in advertised protocol version) — disclosed in the test and results doc.Notes
--no-default-featuresto dropenforce-minimum-cpu(panics on < 16-core hosts).register-encryption-keyprecedescreate; v4 genesis forinternal_presign_sessions; long epoch to clear the mid-epoch reconfiguration; sign confirmed via the coordinator's on-chain completed-session count.🤖 Generated with Claude Code